Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Free, publicly-accessible full text available June 20, 2026
-
Free, publicly-accessible full text available June 4, 2026
-
Free, publicly-accessible full text available June 20, 2026
-
Free, publicly-accessible full text available June 4, 2026
-
We propose a distributed quantum computing (DQC) architecture in which individual small-sized quantum computers are connected to a shared quantum gate processing unit (S-QGPU). The S-QGPU comprises a collection of hybrid two-qubit gate modules for remote gate operations. In contrast to conventional DQC systems, where each quantum computer is equipped with dedicated communication qubits, S-QGPU effectively pools the resources (e.g., the communication qubits) together for remote gate operations, and, thus, significantly reduces the cost of not only the local quantum computers but also the overall distributed system. Our preliminary analysis and simulation show that S-QGPU's shared resources for remote gate operations enable efficient resource utilization. When not all computing qubits (also called data qubits) in the system require simultaneous remote gate operations, S-QGPU-based DQC architecture demands fewer communication qubits, further decreasing the overall cost. Alternatively, with the same number of communication qubits, it can support a larger number of simultaneous remote gate operations more efficiently, especially when these operations occur in a burst mode.more » « less
-
Free, publicly-accessible full text available April 24, 2026
-
Free, publicly-accessible full text available March 30, 2026
-
The deployment of Deep Learning Recommendation Models (DLRMs) involves the parallelization of extra-large embedding tables (EMTs) on multiple GPUs. Existing works overlook the input-dependent behavior of EMTs and parallelize them in a coarse-grained manner, resulting in unbalanced workload distribution and inter-GPU communication. To this end, we propose OPER, an algorithm-system co-design with OPtimality-guided Embedding table parallelization for large-scale Recommendation model training and inference. The core idea of OPER is to explore the connection between DLRM inputs and the efficiency of distributed EMTs, aiming to provide a near-optimal parallelization strategy for EMTs. Specifically, we conduct an in-depth analysis of various types of EMTs parallelism and propose a heuristic search algorithm to efficiently approximate an empirically near-optimal EMT parallelization. Furthermore, we implement a distributed shared memory-based system, which supports the lightweight but complex computation and communication pattern of fine-grained EMT parallelization, effectively converting theoretical improvements into real speedups. Extensive evaluation shows that OPER achieves 2.3× and 4.0× speedup on average in training and inference, respectively, over state-of-the-art DLRM frameworks.more » « less
An official website of the United States government

Full Text Available